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Abstract 



Consider a linear [n,k,d]q code C. We say that that i-th coordinate of C has locahty r, if 
the value at this coordinate can be recovered from accessing some other r coordinates of C. 
Data storage applications require codes with small redundancy, low locality for information 
coordinates, large distance, and low locality for parity coordinates. In this paper we carry out 
an in-depth study of the relations between these parameters. 

We establish a tight bound for the redundancy n — k in terms of the message length, the 
distance, and the locality of information coordinates. We refer to codes attaining the bound as 
optimal. We prove some structure theorems about optimal codes, which are particularly strong 
for small distances. This gives a fairly complete picture of the tradeoffs between codewords 
length, worst-case distance and locality of information symbols. 

We then consider the locality of parity check symbols and erasure correction beyond worst 
case distance for optimal codes. Using our structure theorem, we obtain a tight bound for the 
locality of parity symbols possible in such codes for a broad class of parameter settings. We 
prove that there is a tradeoff between having good locality for parity checks and the ability to 
correct erasures beyond the minimum distance. 

1 Introduction 

Modern large scale distributed storage systems such as data centers store data in a redundant form 
to ensure reliability against node (e.g., individual machine) failures. The simplest solution here is 
the straightforward replication of data packets across different nodes. Alternative solution involves 
erasure coding: the data is partitioned into k information packets. Subsequently, using an erasure 
code, n — k parity packets are generated and all n packets are stored in different nodes. 

Using erasures codes instead of replication may lead to dramatic improvements both in terms of 
redundancy and reliability. However to realize these improvements one has to address the challenge 
of maintaining an erasure encoded representation. In particular, when a node storing some packet 
fails, one has to be able to quickly reconstruct the lost packet in order to keep the data readily 
available for the users and to maintain the same level of redundancy in the system. We say that a 
certain packet has locality r if it can be recovered from accessing only r other packets. One way to 
ensure fast reconstruction is to use erasure codes where all packets have low locality r <^ k. Having 
small value of locality is particularly important for information packets. 



These considerations lead us to introduce the concept of an {r,d)-code, i.e., a Unear code of 
distance d, where all information symbols have locality at most r. Storage system based on (r, d)- 
codes provide fast recovery of information packets from a single node failure (typical scenario), 
and ensure that no data is lost even if up to d — 1 nodes fail simultaneously. One specific class of 
(r, (i)-codes called Pyramid Codes has been considered in [5]. 

Pyramid codes can be obtained from any systematic Maxmimum Distance Seperable (MDS) 
codes of distance d, such as Reed Solomon codes. Assume for simplicity that the first parity check 
symbol is the sum X^iLi of the information symbols. Replace this with j"^] parity checks each 
of size at most r on disjoint information symbols. It is not hard to see that the resulting code C 
has information locality r and distance d, while the redundancy of the code C is given by 

+ d-2. (1) 



n 



1.1 Our results 

In this paper we carry out an in-depth study of the relations between redundancy, erasure-correction 
and symbol locality in linear codes. 

Our first result is a tight bound for the redundancy in terms of the message length, the distance, 
and the information locality. We show that in any [n,k,d]q code of information locality r, 

+ d-2. (2) 

We refer to codes attaining the bound above as optimal. Pyramid codes are one such family of 
codes. The bound ([2]) is of particular interest in the case when r \ k, since otherwise one can improve 
the code by increasing the dimension while maintaining the (r, d)-property and redundancy intact. 
A closer examination of our lower bound gives a structure theorem for optimal codes when r | k. 
This theorem is especially strong when d < r + 3, it fixes the support of the parity check matrix, 
the only freedom is in the choice of coefficients. We also show that the condition r < d + 3 is in 
fact necessary for such a strong statement to hold. 

We then turn our attention to the locality of parity symbols. We prove tight bounds on the 
locality of parity symbols in optimal codes assuming d < r + 3. In particular we establish the 
existence of optimal (r, (i)-codes that are significantly better than pyramid codes with respect to 
locality of parity symbols. Our codes are explicit in the case of d = 4, and non-explicit otherwise. 
The lower bound is proved using the structure theorem. Finally, we relax the conditions d < r + 3 
and r \ k and exhibit one specific family of optimal codes that gives locality r for all symbols. 

Our last result concerns erasure correction beyond the worst case distance of the code. Assume 
that we are given a bipartite graph which describes the supports of the parity check symbols. 
What choice of coefficients will maximize the set of erasure patterns that can be corrected by such 
a code? In [5] the authors gave a necessary condition for an erasure pattern to be correctable, and 
showed that over sufficiently large fields, this condition is also sufficient. They called such codes 
Generalized Pyramid codes. We show that such codes cannot have any non-trivial parity locality; 
thus establishing a tradeoff between parity locality and erasure correction beyond the worst case 
distance. 
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1.2 Related work 



There are two classes of erasure codes providing fast recovery procedures for individual codeword 
coordinates (packets) in the literature. 

Regenerating codes. These codes were introduced in [2] and developed further in e.g., [U [l]. 
See [3j for a survey. One crucial idea behind regenerating codes is that of sub-packetization. Each 
packet is composed of few sub-packets, and when a node storing a packet fails all (or most of) 
other nodes send in some of their sub-packets for recovery. Efficiency of the recovery procedure 
is measured in terms of the overall bandwidth consumption, i.e., the total size of sub-packets re- 
quired to recover from a single failure. Somehow surprisingly regenerating codes can in many 
cases achieve a rather significant reduction in bandwidth, compared with codes that do not employ 
sub-packetization. Our experience with data centers however suggests that in practice there is a 
considerable overhead related to accessing extra storage nodes. Therefore pure bandwidth con- 
sumption is not necessarily the right single measure of the recovery time. In particular, coding 
solutions that do not rely on sub-packetization and thus access less nodes (but download more 
data) are sometimes more attractive. 

Locally decodable codes. These codes were introduced in [6] and developed further in e.g., \10\ 
m E]. See [H] for a survey. An r-query Locally Decodable Code (LDC) encodes messages in 
such a way that one can recover any message symbol by accessing only r codeword symbols even 
after some arbitrarily chosen (say) 10% of codeword coordinates are erased. Thus LDCs are in 
fact very similar to (r, (i)-codes addressed in the current paper, with an important distinction that 
LDCs allow for local recovery even after a very large number of symbols is erased, while (r, d)- 
codes provide locality only after a single erasure. Not surprisingly locally decodable codes require 
substantially larger codeword lengths then (r, (i)-codes. 

1.3 Organization 

In section [3] we establish the lower bound for redundancy of (r, d)-codes and obtain a structural 
characterization of optimal codes, i.e., codes attaining the bound. In section [5] we strengthen the 
structural characterization for optimal codes with d < r -|- 3 and show that any such code has to be 
a canonical code. In section [5] we prove matching lower and upper bounds on the locality of parity 
symbols in canonical codes. Our code construction is not explicit and requires the underlying field 
to be fairly large. In the special case of codes of distance d = 4, we come up with an explicit 
family that does not need a large field. In section [6] we present one optimal family of non-canonical 
codes that gives uniform locality for all codeword symbols. Finally, in section [7| we study erasure 
correction beyond the worst case distance and prove that systematic codes correcting the maximal 
number of erasure patterns (conditioned on the support structure of the generator matrix) cannot 
have any non-trivial locality for parity symbols. 

2 Preliminaries 

We use standard mathematical notation 

• For an integer t, [t] = {!,..., t}; 

• For a vector x, Supp(x) denotes the set {z : Xj 7^ 0}; 
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• For a vector x, wt(x) = |Supp(x)| denotes the Hamming weight; 

• For a vector x and an integer i, x(i) denotes the i-th coordinate of x; 

• For sets A and B, AU B denotes the disjoint union. 

Let C be an [n, k, d]q hnear code. Assume that the encoding of x € is by the vector 

C(x) = (ci-x,C2-x,...,c„-x)GF^. (3) 

Thus the code C is specified by the set of n points C = {ci, . . . , c„} G F^. The set of points must 
have full rank for C to have k information symbols. It is well known that the distance property is 
captured by the following condition (e.g., [9l theorem 1.1.6]). 

Fact 1. The code C has distance d if and only if for every S C. C such that Rank(S') ^ A; — 1, 

|5| «^ n - d. (4) 

In other words, every hyperplane through the origin misses at least d points from C. In this 
work, we are interested in the recovery cost of each symbol in the code from a single erasure. 

Definition 2. For Cj € C, we define Loc(cj) to be the smallest integer r for which there exists 
R Q C of cardinality r such that 

Ci = AjCj. 

We further define Loc(C) = maxjg[„] Loc(ci). 

Note that Loc(cj) ^ k, provided d ^ 2, since this guarantees that C \ {cj} has full dimension. 
We will be interested in (systematic) codes which guarantee locality for the information symbols. 

Definition 3. We say that a code C has information locality r if there exists I Q C of full rank 
such that Loc(c) ^ r for all c G /. 

For such a code we can choose / as our basis for F^ and partition C into I = {ei, . . . ,efc} 
corresponding to information symbols and C \ I = {c^+i, . . . , c„} corresponding to parity check 
symbols. Thus the code C can be made systematic. 

Definition 4. A code C is an {r,d)-code if it has information locality r and distance d. 

For any code C, the set of all linear dependencies of length at most r + 1 on points in C defines 
a natural hypergraph Hr{V, E) whose vertex set V = [n\ is in one-to-one correspondence to points 
in C. There is an edge corresponding to set S '^V \i\S\ ^r + 1 

Equivalently S C [n] is an edge in H if it supports a codeword in of weight at most r + 1. Since 
r will usually be clear from the context, we will just say H{V,E). A code C has locality r if there 
are no isolated vertices in H. A code C has information locality r if the set points corresponding 
to vertices that are incident to some edge in H has full rank. 



4 



We conclude this section presenting one specific class of (r, d)-codes has been considered in [5]: 
Pyramid codes. In what follows the dot product of vectors p and x is denoted by p • x. To define 

an (r, d) pyramid code C encoding messages of dimension k we fix an arbitrary linear systematic 

[k + d — 1, k, d]q code £. Clearly, £ is MDS. Let 

= (x, PO • X, pi • X, . . . , Pd-2 ■ x). 

We partition the set [k] into t = |"^] subsets of size up to r, [k] = |Jie[t] ^ /c-dimensional 

vector x and a set S C [k] let x\s denote the 1 5 1 -dimensional restriction of x to coordinates in the 
set S. We define the systematic code C by 

C(x) = (x, (polsi • xjsj , . . . , {polst ■ x|st) ,Pi • X, . . .,Pd-2 ■ x) . 

It is not hard to see that the code C has distance d. To see that all information symbols and the 
first 1^^] parity symbols of C have locality r one needs to observe that (since £ is an MDS code) 
the vector po has full Hamming weight. The last d — 2 parity symbols of C may have locality as 
large as k. 



3 Lower Bound and the Structure Theorem 

We are interested in systematic codes with information locality r. Given /c, r, d our goal is to 
minimize the codeword length n. Since the code is systematic, this amounts to minimizing the 
redundancy h = n — k. Pyramid codes have h = |^^] + d — 2. Our goal is to prove a matching 
lower bound. Lower bounds of k/r and d — 1 are easy to show, just from the locality and distance 
constraints respectively. The hard part is to sum them up. 

Theorem 5. For any [n,k,d]q linear code with information locality r, 



n — k ^ 



+ d-2. (5) 



Proof. Our lower bound proceeds by constructing a large set S Q C where Rank(S') ^ A; — 1 and 
then applying Fact [TJ The set S is constructed by the following algorithm: 



1. 


Let i = 1, Sq = {}. 




2. 


While Rank(S'i_i) ^ A; - 2: 




3. 


Pick Cj € C \ Si-i such that there is 


a hyperedge Tj in H containing Cj. 


4. 


If Rank(5j_i U Tj) < k, set Si = Si^i 


UTi. 


5. 


Else pick T' C Tj so that Rank(S'j_ 


_iUT') = k-l and set Si = Si-i U T'. 


6. 


Increment i. 





In Line 3, since Rank(5i_i) ^ k — 2 and Rank(/) = k, there exists Cj as desired. Let i denote the 
number of times the set Si is grown. Observe that the final set Si has Rank(S'^) = k — 1. We now 
lower bound We define Si,ti to measure the increase in the size and rank of Si respectively: 



Si — I'S'jl l^j—il, {Sf^l — ^ ^ . 

1=1 

ti = Rank(S'j) — Rank(5j_i), Rank(5£) = ti = k — 1. 



Si ) 
1=1 



i=l 



5 



We analyze two cases, depending on whether the condition Rank(S'j_i U Tj) = k is ever reached. 
Observe that this condition can only be reached when i = i. 

Case 1: Assume Rank(S'j_i U Tj) ^ k — 1 throughout. In each step we add Sj ^ r + 1 vectors. 
Note that these vectors are always such that some nontrivial linear combination of them yields a 
(possibly zero) vector in Span(5'j_i). Therefore we have tj ^ Sj — 1 ^ r. So there are i ^ ["^t^] 
steps in all. Thus 

\S\ = ^Si^Y.{ti + l)^k-l+ ^ (6) 

1=1 i=l 

Note that k — 1 + \^^~\ ^ k + |"^] — 2 with equality holding whenever r = 1 or A; = 1 mod r. 

Case 2: In the last step, we hit the condition Rank(S'^_i U Ti) = k. Since the rank only 
increases by r per step, i ^ j"^] . For i ^ i — 1, we add a set Tj of Sj ^ r + 1 vectors. Again 
note that these vectors are always such that some nontrivial linear combination of them yields a 
(possibly zero) vector in Span(5i_i). Therefore Rank(S'i) grows by ti where tj ^ Sj — 1. In Step £, 
we add T' C to S. This increases Rank(5') by ^ 1 (since Rank(5) ^ — 2 at the start) and 
IS"! by se ^ i£. Thus 

\S\ = J2''i>Y.^'ti + '^) + te = k+ - 

i=l 1=1 

The conclusion now follows from Fact [1] which implies that IS*! ^ n — d. □ 
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Definition 6. We say that an {r,d)-code C is optimal if its parameters satisfy ([3]) with equality. 

Pyramid codes [5j yield optimal (r, (i)-codes for all values of r, d, and k when the alphabet q is 
sufficiently large. 

The proof of theorem [5] reveals information about the structure of optimal (r, (i)-codes. We 
think of the algorithm as attempting to maximize 

\s\ _ Ell s. 



Rank(5) y:Ii U ■ 

With this in mind, at step i we can choose Cj such that is maximized. An optimal length 
code should yield the same value for \S\ for this (or any) choice of c,,. This observation yields an 
insight into the structure of local dependencies in optimal codes, as given by the following structure 
theorem. 

Theorem 7. Let C be an [n, k,d]q code with information locality r. Suppose r \ k, r < k, and 

k 

n = k^ hd-2; (8) 

r 

then hyperedges in the hypergraph H{V,E) are disjoint and each has size exactly r + 1. 

Proof. We execute the algorithm presented in the proof of theorem[5]to obtain a set S and sequences 
{si} and {ti}. We consider the case of r = 1 separately. Since all ti ^ 1 we fall into Case 1. 
Combining formulas Q, and ([8]) we get 



\S\ = "^Si = '^ti + i = 2k 



i=l i=l 
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Combining this with Yli=i ti = k — 1 we conclude that i = k — 1, all equal 2, and all ti equal 1. 
The latter two conditions preclude the existence of hyperedges of size 1 or intersecting edges in H. 

We now proceed to the case of r > 1. When r \ k, the bound in equation ([6]) is larger than that 
in equation ([7]). Thus, we must be in Case 2. Combining formulas ([7]), ^ and ^ we get 

\s\ = Y,si = ^ti + e-i = k + - - 2. 

i=l i=l ^ 

Observe that X]i=i ti = k — 1 and thus £ = ^. Together with the constraint ti ^ r, this implies that 
tj = r — 1 for some j G [I] and ti = r for i ^ j. We claim that in fact j = £. Indeed, if j < I, we 
would have ti = k — r — 1 and ti = r, hence we would be in Case 1. 

Now assume that there is an edge T with |r| ^ r. By adding this edge to S at the first 
step, we would get ti ^ r — 1. Next assume that Ti D T2 is non-empty. Observe that this implies 
Rank(Ti U T2) < 2r. So if we add edges Ti and T2 to S, we have ti + t2 ^ 2r — 1. Clearly these 
conditions lead to contradiction if £ = ^ ^ 3. In fact, they also give a contradiction for ^ = 2, since 
they put us in Case 1. □ 

4 Canonical Codes 

The structure theorem implies then when d is sufficiently small (which in our experience is the set- 
ting of interest in most data storage applications), optimal (r, (i)-codes have rather rigid structure. 
We formalize this by defining the notion of a canonical code. 

Definition 8. Let C be a systematic [n, k, d]q code with information locality r where r \ k, r < k, 
and n = k + ^ + d — 2. We say that C is canonical if the set C can partitioned into three groups 
C = luC UC" such that: 

1. Points I = {ei, . . . , e^}. 

2. Points C = {c'l, . . . ,c^y^} where wt(c^) = r. The supports of these vectors are disjoint sets 
which partition [k]. 

3. Points C" = {c'Z, . . . , c';^_^} where wt«) = k. 

Clearly any canonical code is systematic and has information locality r. The distance property 
requires a suitable choice of vectors {c'} and {c"}. Pyramid codes [5j are an example of canonical 
codes. We note that since r < k, there is always a distinction between symbols {c'} and {c"}. 

Theorem 9. Assume that d < r + 3, r < k, and r \ k. Let n = k + ^+ d — 2. Every systematic 
[n,k,d]q code with information locality r is a canonical code. 

Proof. Let C be a systematic [n, k, d] code with information locality r. We start by showing that 
the hypergraph H{V, E) has ^ edges. 

Since C is systematic, we know that / = {ei, . . . ,efc} C C. By theorem [71 H{V,E) consists of 
m disjoint, (r + l)-regular edges and every vertex in / appears in some edge. But since the points 
in / are linearly independent, every edges involves at least one vertex from C \ I and at most r 
from /. So we have m ^ ^. We show that equality holds. 
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Assume for contradiction that m ^ f + 1- Since the edges are regular and disjoint, we have 

k k 
n ^ m(r + I) = k -\ \r r + 1 > k -\ \- d - 2 

which contradicts the choice of n. Thus m = ^. This means that every edge Tj is incident on 
exactly r vertices . . . , ej^ from / and one vertex outside it. Hence 

r 

Since the TjS are disjoint, the vectors c'^^, . . . ,c'^/^ have disjoint supports which partition [k]. 

We now show that the remaining vectors c'/, . . . ,c^_2 must all have wt(c") = k. For this, we 
consider the encoding of ej. We note ej • 7^ iff i = j and c'- ■ ej ^ iff j E Supp(cj). Thus only 
2 of these inner products are non-zero. Since the code has distance d, all the d — 2 inner products 
c'- ■ Bj are non-zero. This shows that wt(c^') = k for all i. □ 

The above bound is strong enough to separate having locality of r from having just information 
locality of r. The following corollary follows from the observation that the hypergraph H{V, E) 
must contain n — ^(r + 1) = d — 2 isolated vertices, which do not participate in any linear relations 
of size r + 1. 

Corollary 10. Assume that 2 < d < r + 3 and r \ k. Let n = k + ^ + d — 2. There are no [n, k, d]q 
linear codes with locality r. 



5 Canonical codes: parity locality 

Theorem [9] gives a very good understanding optimal (r, (i)-codes in the case r < d + 3 and r j k. 
For any such code the coordinate set C = {cj}jg[„] can be partitioned into sets /, C", C" where for 
all c € / U C , Loc(c) = r, and for all c" G C", Loc(c") > r. It is natural to ask how low can the 
locality of symbols c" € C" be. In this section we address and resolve this question. 

5.1 Parity locality lower bound 

We begin with a lower bound. 

Theorem 11. Let C be a systematic optimal (r, d)-code with parameters [n, k, d]q. Suppose d < r+3, 
r < k, and r \ k. Then some ^ parity symbols of C have locality exactly r, and d — 2 other parity 
symbols of C have locality no less than 

fc-(^^-l)(d-3). (9) 

Proof. Theorem [9] implies that C is a canonical code. Let C = lUC'UC" be the canonical partition 
of the coordinates of C. Clearly, for all ^ symbols c' G C' we have Loc(c') ^ r. We now prove lower 
bounds on the locality of symbols in C U C". 



8 



We start with symbols c" € C" . For every j G [k/r] we define a subset Rj C C that we call a 
row. Let Sj = Supp ( c^- j . The j-th row contains the vector c^, all r unit vectors in the support of 



c'j and the set C" . 

^j = mu I U u^"- 

Observe that restricted to / U C" rows {Rj}j(^[k/r] form a partition. Consider an arbitrary symbol 
c" G C". Let £ = Loc(c"). We have 

c" = ^c„ (10) 

where \L\ = i. In what follows we show that for each row Rj, 

\Rjr\L\-^r (11) 

needs to hold. It is not hard to see that this together with the structure of the sets {Rj} implies 
inequality ([9]). To prove ([TT]) we consider the code 

Cj = {C(x) I X G such that Supp(x) C Sj). (12) 

It is not hard to see that Supp(Cj) = Rj and dimCj = r. Observing that the distance of the code 
Cj is at least d and \Rj \ = r + d— 1 we conclude that (restricted to its support) Cj is an MDS code. 
Thus any r symbols of Cj are independent. It remains to note that (|10p restricted to coordinates 
in Sj yields a non-trivial dependency of length at most \Rj f^L\ + 1 between the symbols of Cj. 

We proceed to the lower bound on the locality of symbols in C' . Fix an arbitrary c'- G C". A 
reasoning similar to the one above implies that if Loc(cj) < r; then there is a dependency of length 
below r + 1 between the coordinates of the [r + d — 1, r, d]g code Cj (defined by restricted to 
its support. □ 

Observe that the bound ([9]) is close to k only when r is large and d is small. In other cases 
theorem [11] does not rule out existence of canonical codes with low locality for all symbols (including 
those in C"). In the next section we show that such codes indeed exist. In particular we show that 
the bound Q can be always met with equality. 



5.2 Parity locality upper bounds 

Our main results in this section are given by theorems [15] and [16] Theorem [15] gives a general 
upper bound matching the lower bound of theorem II 11 The proof is not explicit. Theorem 1161 gives 
an explicit family of codes in the narrow case of d = 4. We start by introducing some concepts we 
need for the proof of theorem 1151 

Definition 12. Let L Q be a linear space and S Q [n] be a set, \S\ = k. We say that S is a 
k-core for L if for all vectors v G L, Supp(v) ^ S. 

It is not hard to verify that S" is a /c-core for L, if and only if S" is a subset of some set of 
information coordinates in the space L^. In other words 5" is a k-cove for L, if and only k columns 
in the (n — dimL)-by-n generator matrix of that correspond to elements of S are linearly 
independent. 
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Definition 13. Let L ¥^ be a linear space. Let {ci, . . . ,c„} be a sequence of n vectors in F^. 
We say that vectors {cj} are in general position subject to L if the following conditions hold: 

1. For all vectors v & L we have X^iLi — Oj 

2. For all k-cores S of L we have Rank ({cjjigs) = k. 

The next lemma asserts existence of vectors that are in general position subject to an arbitrary 
linear space provided the underlying field is large enough. 

Lemma 14. Let L CW^ be a linear space and k be a positive integer. Suppose q > kn^; then there 
exists a family of vectors {cj}jg[„] in that are in general position subject to L. 

Proof. We obtain a matrix M S ]F^x" picking the rows of M at random (uniformly and indepen- 
dently) from the linear space L"*". We choose vectors {cj} to be the columns of M. Observe that 
the first condition in definition [13] is always satisfied. Further observe that our choice of M induces 
a uniform distribution on every set of k columns of M that form a fc-core. The second condition in 
definition 1131 is satisfied as long as all k-hy-k minors of M that correspond to fc-cores are invertible. 
This happens with probability at least 

-0 ('-n(-7))-C)-(-0-^)>-"' f>°- 

This concludes the proof. □ 
We proceed to the main result of this section. 

Theorem 15. Let 2<d<r + 3,r<k,r\k. Let q > kn^ be a prime power. Let n = k + ^ + d — 2. 
There exists a systematic [n,/c, d]g code C of information locality r, where ^ parity symbols have 
locality r, and d — 2 other parity symbols have locality A; — — l) (d — 3). 

Proof. Let t = ^. Fix some t + 1 subsets Po, Pi, . . . , Pt of [n] subject to the following constraints: 

1. |Po| = k-{t-l){d-3) + l; 

2. For all i € [t], \Pi\ = r + 1; 

3. For all i,j € [t] such that i / j, Pi D Pj = 0; 

4. For allie[t],\PonPi\ = r-d + 3. 

For every set Pj, ^ i ^ t we fix a vector Vj € F^, such that Supp(vj) = Pi. We ensure that 
non-zero coordinates of vg contain the same value. We also ensure that for all i € [t] non-zero 
coordinates of Vj contain distinct values. The lower bound on q implies that these conditions can 
be met. For a finite set A let A° denote a set that is obtained from A by dropping at most one 
element. Note that for all i € [t] and all non-zero a, f3 in Fg we have 

Supp(avo + /3v,) = (Po \ Pi) U (Po n Pi)° U (P^ \ Pq). (13) 



10 



Consider the space L = Span ({vj}o<;i^t) • Let M = Pq \ |Ji=i Pi- Observe that 

|M| = A; - - l)((i - 3) + 1 - i(r - d + 3) = d - 2. 
By (jl3p for any v € L we have 

for some T C [t\ OR 



Supp(v) 



u 

M U (Po n PO U (Po n Pi)° U (Pi \ Pq) for some T Q [t]. 

ie[n]\T i£T iGT 



(14) 



Observe that a set X C [n], |i^| = A; is a A;-core for L if and only if for all i € [t], Pi K and 
M^K; OR 



M CK and 3i e [t] such that 



|Pi n Po n < r - d + 2; OR 



(15) 



|Pi n Po n K 



d + 2 and Pi \ Pq 2 K. 



Let I C [n] be such that M n I = and for all i G [t], |/ n Pj] = r. By ([T5]) / is a fc-core for L. We 
use lemma [HI to obtain vectors {cj}jg[„] G that are in general position subject to the space L. 
We choose vectors {cjjjg/ as our basis for and consider the code C defined as in ([3]). 

In it not hard to see that C is a systematic code of information locality r. All t parity symbols 
in the set ^Llj^j^] Pj^ \ / also have locality r. Furthermore all d — 2 parity symbols in the set M 
have locality k — {t — l){d — 3). It remains to prove that the code C has distance 



d = n-k-t + 2. 



(16) 



According to Fact [T] the distance of C equals — 15*1 where S C [n] is the largest set such that vectors 
{cj}jg5 do not have full rank. By definition [13] for any fc-core K oi L ^nq have Ranklcjjjgi^- = k. 
Thus in order to establish (jl6p it suffices to show that every set S C [n] of size k + t — 1 contains a 
fc-core of L. Our proof involves case analysis. Let S* C [n], IS*! = /c + i — 1 be an arbitrary set. Set 

6 = #{i e [t] I p, c S}. 

Note that since t{r + 1) > IS"! we have h — 1. 

Case 1: M ^ S. We drop t — 1 elements from S to obtain a set K C S, \K\ = k such that for 
all i e [t], Pi <^ K. By K is a k-cove. 

Case 2: M C 5 and b ^ t — 2. We drop t — I elements from S to obtain a set K S, \K\ = k 
such that M and for all i £ [t], Pi ^ K. By ^ K is a fc-core. 

Case 3: M C S* and b = t — 1. Let i G [t] be such that Pj ^ S. Such i is unique. Observe that 



\P^nS\ = k + t-l-{d-2)-{t-l){r + l) 



d + 2. 



Also observe that |Pj\Po| = r + l — (r — d + 3) = d — 2 ^ 1. Combining the last two observations 
we conclude that either 



|Pi n Po n 5| < r - d + 2; OR 

|Pi n Po n S"! = r - d + 2 and Pi \ Po 2 5. 



(17) 



Finally, we drop t — 1 elements from S to obtain a set K S, \K\ = k such that for all i G [t], 
Pi 2 K. By ([15]) and K is a k-core. □ 
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Theorem [15] gave a general construction of (r, (i)-codes that are optimal not only with respect 
to information locality and redundancy but also with respect to locality of parity symbols. That 
theorem however is weak in two respects. Firstly, the construction is not explicit. Secondly, the 
construction requires a large underlying field. The next theorem gives an explicit construction that 
works even over small fields in the narrow case of codes of distance 4. 

Theorem 16. Let r < k, r \ k be positive integers. Let q ^ r+2 be a prime power. Let n = k+^+2. 
There exists a systematic [n,k, 4]g code C of information locality r, where ^ parity symbols have 
locality r, and 2 other parity symbols have locality k — ^ + 1. 

Proof. Fix an arbitrary systematic [r + 3, r, 4]q code £. For instance, one can choose £^ to be a Reed 
Solomon code. Let 

^(y) = (y,Po • y,pi ■ y,P2 • y)- 

Since £" is a MDS code all vectors {pi} have weight r. Thus for some non-zero {cij}jelr] we have 

r-l 

where {ejjjgf^] are the r-dimensional unit vectors. To define a systematic code C we partition the 
input vector x G into t = ^ vectors yi, . . . , yj € F^. We set 

C(x) = (yi, . . . ,yt,po • yi,. . . ,Po • yt, (pi - ^Yi) > (p2 • X]^*)) ' ^^^^ 

where the summation is over all i G [t]. It is not hard to see that the first k + t coordinates of C 
have locality r. We argue that the last two coordinates have locality k — t + 1. From (llSp we have 

r-l 

(pi • J^yi) = J^oj (ej • J^yi) +ar (p2 • J^yi) > 

where the summation is over all i € [t]. Equivalently, 

r-l 

(pi -^yi) =^Y1 + (p2 • yi) • 

J = l ie[t] 

Thus the next-to-last coordinate of C can be recovered from accessing (r — l)t information coordi- 
nates and the last coordinate. Similarly, the last coordinate can be recovered from k — t information 
coordinates and the next-to-last coordinate. To prove that the code C has distance 4 we give an 
algorithm to correct 3 erasures in C. The algorithm has two steps. 

Step 1: For every i G [t], we refer to a subset (yj, Po ■ yj) of r + 1 coordinates of C as a block. 
We go over all t blocks. If we encounter a block where one symbol is erased, we recover this symbol 
from other symbols in the block. 

Step 2: Observe that after the execution of Step 1 there can be at most one block that has 
erasures. If no such block exists; then on Step 1 we have successfully recovered all information 
symbols and thus we are done. Otherwise, let the unique block with erasures be (yj,Po • y^) for 
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some j € [t]. Since we know all vectors {yiji^j, ie[t], from (^pi • Yij and (^p2 • Yij (if 

these symbols are not erased) we recover symbols pi -yj and p2 -yj. Finally, we invoke the decoding 
procedure for the code £ to recover yj form at most 3 erasures in £{yj) = (yj,Po ■ Yj^Pi ' yj)P2 • 

y.)- □ 

6 Non-Canonical Codes 

In this section we observe that canonical codes detailed in sections U] and [5] are not the only family 
of optimal (r, (i)-codes. If one relaxes conditions of theorem [9] one can get other families. One 
such family that yields uniform locality for all symbols is given below. The (non-explicit) proof 
resembles the proof of theorem 1151 albeit is much simpler. 

Theorem 17. Let n, k, r, and d ^ 2 be positive integers. Let q > kn^ be a prime power. Suppose 
(r + 1) I n and 

'k' 



n — k 



+ d-2: 



r 

then there exists an [n,k,d]q code where all symbols have locality r. 

Proof. Let t = jq-^^. We partition the set [n] into t subsets Pi, . . . ,Pt each of size r + 1. For every 
i € [t] we fix a vector Vj G F^, such that Supp(vj) = Pj. We set all non-zero coordinates in vectors 
to be equal to 1. We consider the linear space L = Span ({vj}jg[(]) . For every any v € L 

we have 

Supp(v) = I I Pj for some for some T C [t]. 

i&T 

Observe that a set K [n], \K\ = /c is a fc-core for L if and only if for all z G [t], Pi ^ K. Also 
observe that conditions of the theorem imply k ^ n — t. Therefore /c-cores for L exist. We use 
lemma [Til to obtain vectors {cj}jg[„] G that are in general position subject to the space L. We 
consider the code C defined as in ([3|). In it not hard to see that C has dimension k and locality r 
for all symbols. It remains to prove that the code C has distance 



d = n — k 



+ 2. (20) 



Our proof relies on FactlH Let S C [n] be an arbitrary subset such that Rank{ci}jg5 < k. Clearly, 
no /c-core of L is in S. Let 

6 = #{^ G [t] I P, c S}. 

We have jS*! — 6 ^ fe — 1 since dropping h elements from S (one from each Pi C S) turns l^l into 
an (1 5*1 — 6)-core. We also have 6r ^ A; — 1 since dropping one element from each Pi ^ S gives us a 
6r-core in S. Combining the last two inequalities we conclude that 



k-l 



1. 



Combining this inequality with the identity [^^r-J = [7] — 1 and Fact [T] we obtain (j20p . □ 
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7 Beyond Worst- Case Distance 



In this section, codes are assumed to be systematic unless otherwise stated. They will have k 
information symbols and h = n — k parity check symbols. 

7.1 Generalized Pyramid Codes 

The supports of the parity check symbols in a code can be described using a bipartite graph. More 
generally, we define the notion of a set of points with supports matching a graph G. 

Definition 18. Let G{[k], [h], E) he a bipartite graph. We say that ci, . . . , c/^ € have supports 
matching G i/Supp(cj) = r(j) for all j € [h\ where r(j) denotes the neighborhood of j in G. 

Given points ci, . . . , c^, consider the k x h matrix C with columns ci, . . . , c/^. For / C [k] and 
J C [h], let Cjj denote the sub-matrix of C with rows indexed by / and columns indexed by J. 

Definition 19. Points Ci, . . . , c/j G with supports matching G are in general position if for every 
I C [h] and J C [k] such that there is a perfect matching from I to J in G, the sub-matrix Cj^j is 
invertible. 

Standard arguments show that over sufficiently large fields Fg, choosing ci, . . . ,c/j randomly 
from the set of vectors with support matching G gives points in general position. 

Coming back to codes, the supports of the parity checks define a bipartite graph which we will 
call the support graph. This is closely related to but distinct from the Tanner graph. 

Definition 20. Let C be a systematic code with point set C = {ei, . . . , e^, ci, . . . , Ch}- The support 
graph G{[ki[h],E) of C is a bipartite graph where {i,j) E if Gi ^ Supp(cj). 

For instance in any canonical (r, d)-code, the support graph is specified up to relabeling. There 
are ^ vertices in V of degree r corresponding to C" C C, whose neighborhood partitions the set U 
and d — 2 vertices of degree k corresponding to C" C C. The minimum distance of such a code is 
exactly d, and hence there are some patterns of d erasures that the code cannot correct. However 
it is possible that the code could correct many patterns of erasures of weight d and higher, for a 
suitable choice of CjS. In general one could ask: among all codes with a support graph G, which 
codes can correct the most erasure patterns? A Priori, it is unclear that there should be a single 
code that is optimal in the sense that it corrects the maximal possible set of patterns. As shown 
by [5] such codes do exist over sufficiently large fields. 

Consider a systematic code C with support graph G'([A;], [h],E). Given / C [k] and J C [h], let 
rj(I) denote r(/) n J (define Tj{J) similarly). Consider a set of erasures 5 U T where S C [A;] 
and T C [h] are the sets of information and parity check symbols respectively that are erased. To 
correct these erasures, we need to recover the symbols corresponding to {ej : i G S*} from the parity 
checks corresponding to {cj : j & T = [h] \ T}. For this to be possible, a necessary condition is 
that for every S' C S, |r2^(S")| ^ By Hall's theorem, this is equivalent to the existance of a 
matching in G from S to T. We say that such a set of erasures satisfies Hall's condition. 

Definition 21. A systematic code C with support graph G is a generalized pyramid code if every 
set of erasures satisfying Hall's condition can be corrected. 
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We can rephrase this definition in algebraic terms using the notion of points with specified 
supports in general position. 

Theorem 22. Let C be a systematic code with support graph G. C is a generalized pyramid 
code iff Ci, . . . ,Ch, are in general position with supports matching G. 

7.2 The Tradeoff between Locahty and Erasure Correction 

For any parity check symbol Cj, it is clear that Loc(cj) ^ wt(cj) = deg(j). We will show that no 
better locality is possible for a generalized pyramid code. This result relies on a characterization 
of the support of the vectors in the space V spanned by {ci, . . . , c^} in terms of the graph G. 

Let V denote the space spanned by {ci, . . . ,c/i} which are in general position with supports 
matching G. Let Supp(V) C 2^^^ denote the set of supports of vectors in V. We give a necessary 
condition for membership in Supp(V). Our condition is in terms of sets of coordinates that can be 
eliminated by combination of certain CjS. 

Definition 23. Let c = J2j^j l^j^j where fij ^ 0. Let L = [Jji=jT{j) \Supp(c). We say that the set 
I has been eliminated from Ujgjr(j). 

Theorem 24. Let {ci, . . . ,Ch} be vectors with supports matching G in general position. The set L 
can be eliminated from Lij<zjT(j) only if |rj(/')| > [/'| for every /' C /. 

Proof. Let c = where fij ^ 0. Let / = Ujgjr(j) \ Supp(c). Assume for contradiction 

that there exists L C. L where Tj[L) ^ |/|. We will show that there exists /' C / so that Tj{L') = \L'\ 
and that Tj(L") > \L"\ for every non-empty subset /" C 

It suffices to prove the existence of /' C I where |rj(/')| = |/'|; the claim about subsets of /' 
will then follow by taking a minimal such /'. Observe that every i £ L must have |rj(i)| ^ 2, since 
if i occurs in exactly one Sj, then it cannot be eliminated. Hence we must have |/| ^ 2. 

One can construct the set / starting from a singleton and adding elements one at a time, giving 
a sequence Li, . . . , = I . We claim that for any / ^ £, 

\rj{ii)\-\Li\ ^ |rj(/z_i)i-|/,_ii-L 

This holds since Tj{Li) can only increase on adding i to while \Li\ increases by 1. Since 
|rj(/i)| — ^ 1 whereas |r(/^)| — ^ 0, we must have |rj(/;)| — \Li\ = for some I ^ i. Thus 
we have a set where \Tj{Ii) = as desired. 

Since the set /' satisfies Hall's matching condition, there is a perfect matching from /' to 
J' = Tj{L') in G. But this means that the sub-matrix Cj/j' has full rank. On the other hand, 
c = AfjCj and I' ^ L = UjT{j) \ Supp(c). Let 7r(c) denote the restriction of c onto the co- 
ordinates in /'. Then we have 

^^j7r(cj) = ^ /iivr(cj) = 7r(c) = 0. 

The first equality holds because '7r(cj) = for j r(/'), the second by linearity of vr and the last 
since 7r(c) = 0. Hence the vector ^'j = lies in the kernel of Cj/^j' which contradicts the 

assumption that it has full rank. 

This shows that the condition |rj(/')| > |/'| for all /' C / is necessary. □ 
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Corollary 25. If the set I can be eliminated from Ujgjr(j), then \I\ ^ |J| — 1. 

If the field size q is sufficiently large, the necessary condition given by theorem [53] is also 
sufficient. We defer the proof of this statement to Appendix [Al and prove our lower bound on the 
locality of generalized pyramid codes. 

Theorem 26. In a generalized pyramid code, Loc(cj) = deg(j) for all j G [h]. 

Proof. Assume for contradiction that Loc(ct) ^ deg(t) — 1 for some t E [h]. Hence there exist 
A C [k] and B C [h] (not containing t) such that 



Thus we have eliminated at least deg(t) — a ^ 6 + 1 indices from Uj^Bu{t}^^PP{^j) using a linear 
combination of 6 + 1 vectors. By corollary 1251 this is not possible for vectors in general position. □ 
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A Spaces spanned by general position vectors 

Lemma 27. Let q ^ n be a prime power. The set of supports of vectors in any linear space V C 
is closed under union. 

Proof. Consider two vectors a and b in V with Supp(a) = S and Supp(b) = T. We may assume 
that IS"!, |T| ^ n — 1 and that one set does not contain the other. Now consider a-|-Ab for A G F*. It 
suffices to find A such that a(i)-|-Ab(i) ^ for each i G StlT. This rules out at most |S'nr| ^ n — 2 
values of A, so there is a solution provided q— l>n — 2oiq^n. □ 

It is easy to see that the condition q ^ n is tight by considering the length 3 parity check code 
over F2, where the set of supports is not closed under union. 

Theorem 28. Let q ^ n. Let {ci, . . . ,Ch} be vectors with supports matching G in general position 
which span a space V. Supp(V) consists of all sets of the form UjQjT{j) \ I where I satisfies the 
condition \Tj{L')\ > |/'| for every J' C /. 

Proof. Theorem [2l] shows that the condition on / is necessary, we now show that it is sufficient. 
For j r(/) the sets T{j) and / are disjoint. Hence we can write 



By the closure under union, it suffices to prove the statement in the case when J C r(/). Fix 
jo G J and let J' = J\{jo}- Since |rj(/')| > |/'| we have |rj/(/')| > for every /' C /. So there 
is a matching from / to some subset J" C J' where \J"\ = and the matrix Cjji' is of full rank 
since the c^s are in general position. 

Let 7r(c) denote the restriction of a vector c onto coordinates in /. Since Cj^j" is invertible, the 
row vectors have full rank. Note that n^Cj^) is not a zero vector since jo G r(/). So 

there exist {/ij} for j (z J which are not all and 



U,ejr(j) \ / = (U,ejnr{7)r(i) \ /) \JiU.^j^-^T{j)). 
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Now consider the vector c^-^ = Cj^ — t^j^j- Note that 7r(c^-^) is a zero vector, which shows 

that Supp(c^-^) C ^j£j"[j{jo}^U) \ ^i^l show that equahty holds by using coroharylJSl Since 
we have ehminated \I\ vectors, the Hnear combination must involve at least |/| + 1 vectors, which 
means that fij ^ for all j. Further the set of eliminated co-ordinates cannot be larger than /, 
since this would violate corollary [25j Hence we have 

Supp(c;j = U,ej"u{io}r(i)\^- (21) 

By repeating this argument for every Jq G J, we will be able to find J(jo) ^ ^ of size |/| + 1 
which contains jo and a vector c'j^ such that 

Supp(4,) = Ujgj(jo)r(i) \ /. 

Using the closure under union of supports, we conclude that Supp(V) contains the set 

UjoGj u,-6j(,o) r(j) \ / = Ujejr(i) \ /. 

□ 
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